Goto

Collaborating Authors

 human programmer


A systematic evaluation of large language models for generating programming code

Hou, Wenpin, Ji, Zhicheng

arXiv.org Artificial Intelligence

We systematically evaluated the performance of seven large language models in generating programming code using various prompt strategies, programming languages, and task difficulties. GPT-4 substantially outperforms other large language models, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt strategy outperforms 85 percent of human participants. Additionally, GPT-4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors. The computational efficiency of the code generated by GPT-4 is comparable to that of human programmers. These results suggest that GPT-4 has the potential to serve as a reliable assistant in programming code generation and software development.


LeafAI: query generator for clinical cohort discovery rivaling a human programmer

Dobbins, Nicholas J, Han, Bin, Zhou, Weipeng, Lan, Kristine, Kim, H. Nina, Harrington, Robert, Uzuner, Ozlem, Yetisgen, Meliha

arXiv.org Artificial Intelligence

Objective: Identifying study-eligible patients within clinical databases is a critical step in clinical research. However, accurate query design typically requires extensive technical and biomedical expertise. We sought to create a system capable of generating data model-agnostic queries while also providing novel logical reasoning capabilities for complex clinical trial eligibility criteria. Materials and Methods: The task of query creation from eligibility criteria requires solving several text-processing problems, including named entity recognition and relation extraction, sequence-to-sequence transformation, normalization, and reasoning. We incorporated hybrid deep learning and rule-based modules for these, as well as a knowledge base of the Unified Medical Language System (UMLS) and linked ontologies. To enable data-model agnostic query creation, we introduce a novel method for tagging database schema elements using UMLS concepts. To evaluate our system, called LeafAI, we compared the capability of LeafAI to a human database programmer to identify patients who had been enrolled in 8 clinical trials conducted at our institution. We measured performance by the number of actual enrolled patients matched by generated queries. Results: LeafAI matched a mean 43% of enrolled patients with 27,225 eligible across 8 clinical trials, compared to 27% matched and 14,587 eligible in queries by a human database programmer. The human programmer spent 26 total hours crafting queries compared to several minutes by LeafAI. Conclusions: Our work contributes a state-of-the-art data model-agnostic query generation system capable of conditional reasoning using a knowledge base. We demonstrate that LeafAI can rival an experienced human programmer in finding patients eligible for clinical trials.


Is Model Attention Aligned with Human Attention? An Empirical Study on Large Language Models for Code Generation

Kou, Bonan, Chen, Shengmai, Wang, Zhijie, Ma, Lei, Zhang, Tianyi

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have been demonstrated effective for code generation. Due to the complexity and opacity of LLMs, little is known about how these models generate code. To deepen our understanding, we investigate whether LLMs attend to the same parts of a natural language description as human programmers during code generation. An analysis of five LLMs on a popular benchmark, HumanEval, revealed a consistent misalignment between LLMs' and programmers' attention. Furthermore, we found that there is no correlation between the code generation accuracy of LLMs and their alignment with human programmers. Through a quantitative experiment and a user study, we confirmed that, among twelve different attention computation methods, attention computed by the perturbation-based method is most aligned with human attention and is constantly favored by human programmers. Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust.


Evaluating GPT's Programming Capability through CodeWars' Katas

Zhang, Zizhuo, Wen, Lian, Zhang, Shaoyang, Chen, David, Jiang, Yanfei

arXiv.org Artificial Intelligence

In the burgeoning field of artificial intelligence (AI), understanding the capabilities and limitations of programming-oriented models is crucial. This paper presents a novel evaluation of the programming proficiency of Generative Pretrained Transformer (GPT) models, specifically GPT-3.5 and GPT-4, against coding problems of varying difficulty levels drawn from Codewars. The experiments reveal a distinct boundary at the 3kyu level, beyond which these GPT models struggle to provide solutions. These findings led to the proposal of a measure for coding problem complexity that incorporates both problem difficulty and the time required for solution. The research emphasizes the need for validation and creative thinking capabilities in AI models to better emulate human problem-solving techniques. Future work aims to refine this proposed complexity measure, enhance AI models with these suggested capabilities, and develop an objective measure for programming problem difficulty. The results of this research offer invaluable insights for improving AI programming capabilities and advancing the frontier of AI problem-solving abilities.


Comparing Software Developers with ChatGPT: An Empirical Investigation

Nascimento, Nathalia, Alencar, Paulo, Cowan, Donald

arXiv.org Artificial Intelligence

The advent of automation in particular Software Engineering (SE) tasks has transitioned from theory to reality. Numerous scholarly articles have documented the successful application of Artificial Intelligence to address issues in areas such as project management, modeling, testing, and development. A recent innovation is the introduction of ChatGPT, an ML-infused chatbot, touted as a resource proficient in generating programming codes and formulating software testing strategies for developers and testers respectively. Although there is speculation that AI-based computation can increase productivity and even substitute software engineers in software development, there is currently a lack of empirical evidence to verify this. Moreover, despite the primary focus on enhancing the accuracy of AI systems, non-functional requirements including energy efficiency, vulnerability, fairness (i.e., human bias), and safety frequently receive insufficient attention. This paper posits that a comprehensive comparison of software engineers and AI-based solutions, considering various evaluation criteria, is pivotal in fostering human-machine collaboration, enhancing the reliability of AI-based methods, and understanding task suitability for humans or AI. Furthermore, it facilitates the effective implementation of cooperative work structures and human-in-the-loop processes. This paper conducts an empirical investigation, contrasting the performance of software engineers and AI systems, like ChatGPT, across different evaluation metrics. The empirical study includes a case of assessing ChatGPT-generated code versus code produced by developers and uploaded in Leetcode.


Humans are Still Better than ChatGPT: Case of the IEEEXtreme Competition

Koubaa, Anis, Qureshi, Basit, Ammar, Adel, Khan, Zahid, Boulila, Wadii, Ghouti, Lahouari

arXiv.org Artificial Intelligence

Since the release of ChatGPT, numerous studies have highlighted the remarkable performance of ChatGPT, which often rivals or even surpasses human capabilities in various tasks and domains. However, this paper presents a contrasting perspective by demonstrating an instance where human performance excels in typical tasks suited for ChatGPT, specifically in the domain of computer programming. We utilize the IEEExtreme Challenge competition as a benchmark, a prestigious, annual international programming contest encompassing a wide range of problems with different complexities. To conduct a thorough evaluation, we selected and executed a diverse set of 102 challenges, drawn from five distinct IEEExtreme editions, using three major programming languages: Python, Java, and C++. Our empirical analysis provides evidence that contrary to popular belief, human programmers maintain a competitive edge over ChatGPT in certain aspects of problem-solving within the programming context. In fact, we found that the average score obtained by ChatGPT on the set of IEEExtreme programming problems is 3.9 to 5.8 times lower than the average human score, depending on the programming language. This paper elaborates on these findings, offering critical insights into the limitations and potential areas of improvement for AI-based language models like ChatGPT.


Can AI Excel in Programming Like Humans?

#artificialintelligence

Artificial intelligence (AI) has come a long way since its inception, with advancements in machine learning and natural language processing allowing it to perform complex tasks. One area where AI has shown significant potential is in programming, with researchers exploring ways to teach machines to code. However, the question remains: Can AI excel in programming like humans? In this article, we will explore the possibilities and limitations of AI in programming. Before delving into whether AI can excel in programming like humans, it's essential to understand what AI in programming entails.


Will DeepMind's AlphaCode Replace Programmers? - KDnuggets

#artificialintelligence

The Alphabet subsidiary DeepMind has done it again, and this time, they are testing the boundaries of AI in software development sectors. DeepMind's AlphaCode was tested against human performance on coding challenges and achieved rank among the top 54% of human coders on Codeforces. This is a remarkable achievement as it is one of its kind. There are other code generation machine learning models, such as OpenAI Codex, but none of them tried to compete with human programmers. A coding challenge is like solving puzzles. To solve these challenges, an individual must have an understanding of logic, math, and programming skills.


DeepMind's AlphaCode Explained: Everything You Need to Know

#artificialintelligence

Programming has been for a long time a high-status, high-demand skill. Companies and businesses across industries depend at a very foundational level on the ability of human developers: People who write and understand the language of computers. Recently, with the advent of large language models, AI companies have begun to explore the possibilities of systems that can learn to code. OpenAI's Codex -- embedded into GitHub Copilot -- was the first notable example. Codex can read simple natural language commands and instructions and write code that matches the intention of the user. Yet, writing small programs and solving easy tasks is "far from the full complexity of real-world programming." AI models like Codex lack the problem-solving skills that most programmers rely on in their day-to-day jobs. That's the gap DeepMind wanted to fill with AlphaCode, an AI system that has been trained to "understand" natural language, design algorithms to solve problems, and then implement them into code. AlphaCode displays a unique skillset of natural language understanding and problem-solving ability, combined with the statistical power characteristic of large language models. The system was tested against human programmers on the popular competitive programming platform Codeforces. AlphaCode averaged a ranking of 54.3% across 10 contests, which makes it the first AI to reach the level of human programmers in competitive programming contests. I've studied the AlphaCode paper to understand what AlphaCode is and isn't, what these impressive results mean, what are the implications, and what the future holds for AI and human developers. I've also researched what AI experts and competitive programmers are saying about AlphaCode, so you have different independent perspectives to form your own. This article is a thorough review divided into 6 sections (and their respective subsections). I will include comments throughout the article to explore some questions, ideas, and results in more depth.


When DeepMind's 'AlphaCode' Competed Against Human Programmers

#artificialintelligence

Among at least a few programmers, this has already provoked some concern. Recently a programming student on Hacker News complained of "AlphaCode Anxiety" (as well as worries about GitHub's Copilot). "Now it feels like I'm running against a clock until the career I am working very hard for will automate itself away," the student wrote. When a blog post at CodeForces declared "The future has arrived," one worried programmer even argued that "there is a limit to what humans should automate." The programmer added pointedly that the DeepMind developers who built AlphaCode "think that they are irreplaceable, but they would be the first ones to get replaced." But the fact that AlphaCode finished in the bottom half was also greeted with a very human disparagement. "AI is such a noob," the first commenter responded.